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Abstract 

Tuning hyperparameters of learning algorithms 
is hard because gradients are usually unavailable. 

We compute exact gradients of cross-validation 
performance with respect to all hyperparameters 
by chaining derivatives backwards through the 
entire training procedure. These gradients al¬ 
low us to optimize thousands of hyperparame¬ 
ters, including step-size and momentum sched¬ 
ules, weight initialization distributions, richly pa¬ 
rameterized regularization schemes, and neural 
network architectures. We compute hyperparam¬ 
eter gradients by exactly reversing the dynamics 
of stochastic gradient descent with momentum. 

1. Introduction 

Machine learning systems abound with hyperparameters. 
These can be parameters that control model complexity, 
such as Li and L 2 penalties, or parameters that specify the 
learning procedure itself - step sizes, momentum decay pa¬ 
rameters and initialization conditions. Choosing the best 
hyperparameters is both crucial and frustratingly difficult. 

The current gold standard for hyperparameter selection 
is gradient-free model-based optimization (Snoek et al., 
2012; Bergstra et al., 2011; 2013; Hutter et al., 2011). Hy¬ 
perparameters are chosen to optimize the validation loss 
after complete training of the model parameters. These 
approaches have demonstrated that automatic tuning of 
hyperparameters can yield state-of-the-art performance. 
However, in general they are not able to effectively opti¬ 
mize more than 10 to 20 hyperparameters. 

Why not use gradients? Reverse-mode differentiation al¬ 
lows gradients to be computed with a similar time cost to 
the original objective function. This approach is taken al¬ 
most universally for optimization of elementary' parame- 

'The order of these two authors is random. See 
github . com/hips/author-roulette 

'since this paper is about hyperparameters, we use “elemen¬ 
tary” to unambiguously denote the other sort of parameter, the 
“parameter-that-is-just-a-parameter-and-not-a-hyperparameter”. 
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Figure 1. Hyperparameter optimization by gradient descent. Each 
meta-iteration runs an entire training run of stochastic gradient de¬ 
scent to optimize elementary parameters (weights 1 and 2). Gra¬ 
dients of the validation loss with respect to hyperparameters are 
then computed by propagating gradients back through the elemen¬ 
tary training iterations. Hyperparameters (in this case, learning 
rate and momentum schedules) are then updated in the direction 
of this hypergradient. 

ters. The problem with taking gradients with respect to hy¬ 
perparameters is that computing the validation loss requires 
an inner loop of elementary optimization, which makes 
naive reverse-mode differentiation infeasible from a mem¬ 
ory perspective. Section 2 describes this problem and pro¬ 
poses a solution, which is the main technical contribution 
of this paper. 

Gaining access to gradients with respect to hyperparamters 
opens up a garden of delights. Instead of straining to 
eliminate hyperparameters from our models, we can em¬ 
brace them, and richly hyperparameterize our models. Just 
as having a high-dimensional elementary parameterization 
gives a flexible model, having a high-dimensional hyper¬ 
parameterization gives flexibility over model classes, regu¬ 
larization, and training methods. Section 3 explores these 
new opportunities. 

1.1. Contributions 

• We give an algorithm that exactly reverses stochastic 
gradient descent with momentum to compute gradi- 
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ents with respect to all continuous training parameters. 

• We show how to efficiently store only the information 
needed to exactly reverse learning dynamics. For ex¬ 
ample, when the momentum term is 0.9, this method 
reduces the memory requirements of reverse-mode 
differentiation of hyperparameters by a factor of 200 . 

• We show that these gradients allow optimization of 
validation loss with respect to thousands of hyper¬ 
parameters. For example, we optimize fine-grained 
learning-rate schedules, per-layer initialization distri¬ 
butions of neural network parameters, per-input regu¬ 
larization schemes, and per-pixel data preprocessing. 

• We provide insight into learning procedures by exam¬ 
ining optimized learning-rate schedules and initializa¬ 
tion procedures, comparing them to standard advice in 
the literature. 

2. Hypergradients 

Reverse-mode differentiation (RMD) has been an asset to 
the field of machine learning (LeCun et al., 1989) (see the 
7 for a refresher). The RMD method, known as “back- 
propagation” in the deep learning community, allows the 
gradient of a scalar loss with respect to its parameters to 
be computed in a single backward pass. This increases the 
computational burden by only a factor of two over evaluat¬ 
ing the loss itself, regardless of the number of parameters. 
Obtaining the same sort of information by either forward¬ 
mode differentiation or brute force finite differences would 
require a separate pass for each parameter and would make 
deep learning entirely infeasible. 

Applying RMD to hyperparameter optimization was pro¬ 
posed by Bengio (2000) and Baydin & Pearlmutter (2014), 
and applied to small problems by Domke (2012). How¬ 
ever, the naive approach fails for real-sized problems be¬ 
cause of memory constraints. RMD requires that inter¬ 
mediate variables be maintained in memory for the re¬ 
verse pass. Evaluating the validation loss requires train¬ 
ing the model, which may require many elementary itera¬ 
tions. Conventional RMD stores this entire training trajec¬ 
tory, W 1 ...WT in memory. In large neural networks, the 
amount of memory required to store the millions of pa¬ 
rameters being trained is typically close to the amount of 
physical RAM available (Sutskever et al., 2014). If storing 
the parameter vector takes ~1GB, and the parameter vector 
is updated tens of thousands of times (the number of mini 
batches times the number of epochs) then storing the learn¬ 
ing history is unmanageable even with physical storage. 

Imagine that we could exactly trace a training procedure 
backwards, starting from the trained parameter values and 
working back to the initial parameters. Then we could re¬ 
compute the learning trajectory on the fly during the reverse 


pass of RMD rather than storing it in memory. This is not 
possible in general, but we will show that for the popular 
training procedure of stochastic gradient descent with mo¬ 
mentum, we can do exactly this, storing a small number of 
auxiliary bits to handle finite precision arithmetic. 

2.1. Reversible learning with exact arithmetic 

Stochastic gradient descent (SGD) with momentum (Algo¬ 
rithm 1 ) can be seen as a physical simulation of a system 
moving through a series of fixed force fields indexed by 
time t. With exact arithmetic this procedure is reversible. 
This lets us write Algorithm 2, which reverses the steps in 
Algorithm 1, interleaved with computations of gradients. 
It outputs the gradient of a function of the trained weights 
/(w) (such as the validation loss) with respect to the initial 
weights Wi, the learning-rate and momentum schedules, 
and any other hyperparameters which affect training gradi¬ 
ents. 


Algorithm 1 Stochastic gradient descent with momentum 


1 

input: initial wi, decays 7 , learning rates a, loss func- 


tion L{w, 9, t) 


2 

initialize vi = 0 


3 

for t = 1 to T do 


4 

gt = VwT(w4,e,f) 

> evaluate gradient 

5 

Vt+I = 7tVt - (1 - 7t)gt 

> update velocity 

6 

wt+i =vft + atVt 

[> update position 

7 

end for 


8 

output trained parameters wt 





Algorithm 2 Reverse-mode differentiation of SGD 



1: input: wt, v^, 7, a, train loss L(w, 6, t), loss /(w) 
2: initialize dv = 0, dO = 0, dat = 0, d'y = 0 
3: initialize dw = Vw/(w'r) 

4: for t = T counting down to 1 do 

5: dat = 

6: wt_i=wt-atVt 'j exactly reverse 

7: gt = VwT(wt,e,f) V gradient descent 

8: vt_i = [vt -f (1 - 7i)gt]/7t J operations 

9: dv = dv + atdw 

10 : d'jt = dv'^ {vt + ^t) 

11: dw = dw — (1 — 7t)(ivVwVwT(wt, 0, f) 

12 : d6 = dO — {1 — ^t)dvVgV^L{-Wt,9,t) 

13: dv = 7tdv 

14: end for 

15: output gradient of /(wj^) w.r.t Wi, vi, 7 , a and 9 


Computations of steps 11 and 12 both require a Hessian- 
vector product, but these can be computed exactly by ap¬ 
plying RMD to the dot product of the gradient with a vector 
(Pearlmutter, 1994). Thus the time complexity of reverse 
SGD is 0{T), the same as forward SGD. 
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2.2. Reversible learning with finite precision arithmetic 

In practice, Algorithm 2 fails utterly due to finite numeri¬ 
cal precision. The problem is the momentum decay term 7. 
Every time we apply step 8 to reduce the velocity, we lose 
information. Assuming we are using a fixed-point repre¬ 
sentation, ^ each multiplication by 7 < 1 shifts bits to the 
right, destroying the least significant bits. This is more than 
a pedantic concern. Attempting to carry out the reverse 
training requires repeated multiplication by I/7. Errors 
accumulate exponentially, and the reversed learning proce¬ 
dure ends far from the initial point (and usually overflows). 
Do we need 7 < 1? Unfortunately we do. 7 > 1 results 
in unstable dynamics, and 7 = 1, recovers the leapfrog 
integrator (Hut et al., 1995), a perfectly reversible set of 
dynamics, but one that does not converge. 

This problem is quite a deep one: optimization necessarily 
discards information. Ideally, optimization maps all initial¬ 
izations to the same optimum, a many-to-one mapping with 
no hope of inversion. Put another way, optimization moves 
a system from a high-entropy initial state to a low-entropy 
(hopefully zero entropy) optimized final state. 

It is interesting to consider the analogy with physical dy¬ 
namics. The 7 term is analogous to a drag term in the 
simulation of Hamiltonian dynamics. Having 7 < 1 cor¬ 
responds to dissipative dynamics which generates heat, in¬ 
creases the entropy of the environment and is not therefore 
not reversible. But we must have dissipation in order for 
our system to converge to equilibrium. 

If we want to reverse the dynamics, there is no choice but 
to store the extra bits discarded by the 7 operation. But 
we can at least try to be parsimonious about the number of 
extra bits we store. This is what the next section addresses. 

2.3. Optimal storage of discarded entropy 

This section gives the technical details of how to efficiently 
store the information discarded each time the momentum 
decay operation (Step 8) is applied. 

If 7 = 0.5, we can simply store the single bit that falls off 
at each iteration, and if 7 = 0.25 we could store two bits. 
But for fine-grained control over 7 we need a way to store 
the information lost when we multiply by, say, 7 = 0.9, 
which will be less than one bit on average. Here we give a 
procedure which achieves exactly this. 

We represent the velocity v and parameter w vectors with 

^We assume fixed-point representation to simplify the discus¬ 
sion (and the implementation). Courbariaux et al. (2014) show 
that fixed-point arithmetic is sufficient to train deep networks. 
Floating point representation doesn’t fix the problem, it just de¬ 
fers the loss of infoiTnation from the division step to the addition 
step. 


64-bit integers. With an implied radix point this can be 
a fixed-point representation of the reals. We represent 7 
as a rational number, n/d. When we divide each u by d 
we use integer division. In order to be able to reverse the 
process we just need to store the remainder, v modulo s, 
in some “information buffer”, B. If B were an integer and 
n = 2, the remainder r would just be a single bit, and 
we could store it in B by left-shifting i3’s bits and adding 
r. Eor arbitrary n, we can do the base-n analogue of this 
operation: multiply B hy n and add r. Eventually, B will 
overflow. We need a way to either detect this, store the bits, 
and start a fresh integer, or else we can just use an arbitrary 
size integer that grows as needed. (Python’s “long” integer 
type supports this). This procedure allows division by n 
while storing the remainder in log 2 (n) bits on average. 

When we multiply by the numerator of n/d we don’t need 
to store anything extra, since integer division will bring us 
back to exactly the same point anyway. But the procedure 
as it stands would store three bits when 7 = 7/8, whereas 
it should store less than one (log2(8/7) = 0.19). Our so¬ 
lution is the following: when we multiply v by n, there is 
an opportunity to add a nonnegative integer smaller than n 
to the result without affecting the reverse process (integer 
division by n). We can get such an integer from the infor¬ 
mation buffer by dividing it by n and recording B modulo 
n. We are using the velocity v as an information buffer 
itself! Algorithm 3 illustrates the entire process. 


Algorithm 3 Exactly reversible multiplication by a ratio 


Input: Information buffer i, value c, ratio n/d 


i = i X d 

i = i + (c mod d) 
c = d 
c = c X n 
c = c + {i mod n) 
i = i ^ n 


> make room for new digit 

> store digit lost by division 

t> divide by denominator 
> multiply by numerator 
> add digit from buffer 

> shorten information buffer 


return updated buffer i, updated value c 


We could also have used an arithmetic coding scheme for 
our information buffer (MacKay, 2003, Chapter 6). How 
much does this procedure save us? When 7 = 0.98, we 
will have to store only 0.029 bits on average. Compared 
to storing a new 32-bit integer or floating-point number at 
each iteration, this reduces memory requirements by a fac¬ 
tor of one thousand. 

The standard way to save memory in RMD is check¬ 
pointing. Checkpointing stores the entire parameter vec¬ 
tor on only a fraction of the training steps, and recomputes 
the missing steps of the training procedure (forwards) as 
needed during the backward pass. However, this would re¬ 
quire too much memory to be practical for large neural nets 
trained for thousands of minibatches. 
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3. Experiments 

In typical machine learning applications, only a few hyper¬ 
parameters (less than 20) are optimized. Since each ex¬ 
periment only yields a single number (the validation loss), 
the search rapidly becomes more difficult as the dimen¬ 
sion of the hyperparameter vector increases. In contrast, 
when hypergradients are available, the amount of informa¬ 
tion gained from each training run grows along with the 
number of hyperparameters, allowing us to optimize thou¬ 
sands of hyperparameters. How can we take advantage of 
this new ability? 

This section shows several proof-of-concept experiments in 
which we can more richly parameterize training and regu¬ 
larization schemes in ways that would have been previously 
impractical to optimize. 

3.1. Gradient-based optimization of gradient-based 
optimization 

Modern neural net training procedures often employ var¬ 
ious heuristics to set learning rate schedules, or set their 
shape using one or two hyperparameters set by cross- 
validation (Dahl et ah, 2014; Sutskever et ah, 2013). These 
schedule choices are supported by a mixture of intuition, 
arguments about the shape of the objective function, and 
empirical tuning. 

To more directly shed light on good learning rate schedules, 
we jointly optimized separate learning rates for every sin¬ 
gle learning iteration of training of a deep neural network, 
as well as separately for weights and biases in each layer. 
Each meta-iteration trained a network for 100 iterations of 
SGD, meaning that the learning rate schedules were spec- 
ihed by 800 hyperparameters (100 iterations x 4 layers x 
2 types of parameters). To avoid learning an optimization 
schedule that depended on the quirks of a particular random 
initialization, each evaluation of hypergradients used a dif¬ 
ferent random seed. These random seeds were used both to 
initialize network weights and to choose mini batches. The 
network was trained on 10,000 examples of MNIST, and 
had 4 layers, of sizes 784, 50, 50, and 50. 

Because learning schedules can implicitly regularize net¬ 
works (Erhan et al., 2010), for example by enforcing early 
stopping, for this experiment we optimized the learning rate 
schedules on the training error rather than on the validation 
set error. Eigure 2 shows the results of optimizing learning 
rate schedules separately for each layer of a deep neural 
network. When Bayesian optimization was used to choose 
a hxed learning rate for all layers and iterations, it chose a 
learning rate of 2.4. 

Meta-optimization strategies We experimented with 
several standard stochastic optimization methods for meta- 


Optimized learning rate schedule 



Figure 2. A learning-rate training schedule for the weights in each 
layer of a neural network, optimized by hypergradient descent. 
The optimized schedule starts by taking large steps only in the 
topmost layer, then takes larger steps in the first layer. All layers 
take smaller step sizes in the last 10 iterations. Not shown are 
the schedules for the biases or the momentum, which showed less 
structure. 


Elementary learning curves Meta-learning curve 



Figure 3. Elementary and meta-learning curves. The meta¬ 
learning curve shows the training loss at the end of each elemen¬ 
tary iteration. 


optimization, including SGD, RMSprop (Tieleman & Hin¬ 
ton, 2012), and minibatch conjugate gradients. The results 
in this section used Adam (Kingma & Ba, 2014), a variant 
of RMSprop that includes momentum. We typically ran for 
50 meta-iterations, and used a meta-step size of 0.04. Eig¬ 
ure 3 shows the elementary and meta-learning curves that 
generated the hyperparameters shown in Eigure 2. 

How smooth are hypergradients? To demonstrate that 
the hypergradients are smooth with respect to time steps 
in the training schedule. Figure 4 shows the hypergradient 
with respect to the step size training schedule at the begin¬ 
ning of training, averaged over 100 random seeds. 

Optimizing weight initialization scales We optimized a 
separate weight initialization scale hyperparameter for each 
type of parameter (weights and biases) in each layer - a total 
of 8 hyperparameters. Results are shown in Figure 5. 

Interestingly, the initialization scale chosen for the hrst 
layer weights matches a heuristic which says to choose an 
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Hypergradient at first meta-iteration 



Schedule index 


Figure 4. The initial gradient of the cross-validation loss with re¬ 
spect to the training schedule, averaged over 100 random weight 
initializations and mini batches. Colors correspond to the same 
layers as in Figure 2. 


Biases 


Weights 




Meta iteration 


Figure 5. Meta-leaming curves for the initialization scales of each 
layer in a 4-layer deep neural network. Left: Initialization scales 
for biases. Right: Initialization scales for weights. Dashed lines 
show a heuristic which gives an average total activation of 1. For 
the first layer it is (l/\/784) and for subsequent layers (1/\/5C)). 

initialization scale of l/v/]V, where N is the number of 
weights in the layer. 

3.2. Optimizing regularization parameters 

Regularization is often important for generalization per¬ 
formance. Typically, a single parameter controls a single 
L 2 norm or sparsity penalty on the entire parameter vector 
of a neural network. Because different types of parame¬ 
ters in different layers play different roles, it is reasonable 
to suspect that separate regularization hyperparameter for 
each parameter type would improve performance. Indeed, 
Snoek et al. (2012) optimized separate regularization pa¬ 
rameters for each layer in a neural network, and found that 
it improved performance. 

We can take this idea even further, and introduce a sepa¬ 
rate regularization penalty for each individual parameter in 
a neural network. We use a simple model as an example - 
logistic regression, which can be seen as a neural network 
without a hidden layer. We choose this model because ev- 


^1 
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I 
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Figure 6. Optimized L 2 regularization hyperparameters for each 
weight in a logistic regression trained on MNIST. The weights 
corresponding to each output label (0 through 9 respectively) have 
been rendered separately. High values (black) indicate strong reg¬ 
ularization. 


ery weight corresponds to an input-pixel and output-label 
pair, meaning that these 7,840 hyperparameters might be 
relatively interpretable. Figure 6 shows a set of regulariza¬ 
tion hyperparameters learned for a logistic regression net¬ 
work. Because each parameter corresponds to a particular 
input, this regularization scheme could be seen as a gener¬ 
alization of automatic relevance determination (MacKay & 
Neal, 1994). 

3.3. Optimizing training data 

We can use Algorithm 2 to take the gradient with respect 
to any parameter the training procedure depends on. This 
includes the training data, which can be viewed as just 
another set of hyperparameters. By chaining gradients 
through transformations of the data, we can compute gra¬ 
dients of the validation objective with respect to data pre¬ 
processing, weighting, or augmentation procedures. 

We demonstrate a simple proof-of-concept where an entire 
training set is learned by gradient descent, starting from 
blank images. Figure 7 shows a training set, the pixels of 
which were optimized to improve performance on a vali¬ 
dation set of 10,000 examples from MNIST. We optimized 
10 training examples, each having a different fixed label, 
again from 0 to 9 respectively. Learning the labels of a 
larger training set might shed light on which classes are 
difficult to distinguish and so require more examples. 



Figure 7. A dataset generated purely through meta-learning. Each 
pixel is treated as a hyperparameter, which are all optimized to 
maximize validation-set performance. Training labels are fixed in 
order from 0 to 9. Some optimal pixel values are negative. 
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3.4. Optimizing initial parameters 

The last remaining parameter to SGD is the initial param¬ 
eter vector. Treating this vector as a hyperparameter blurs 
the distinction between learning and meta-learning. In the 
extreme case where all elementary learning rates are set to 
zero, the training set ceases to matter and the meta-learning 
procedure exactly reduces to elementary learning on the 
validation set. Due to philosophical vertigo, we chose not 
to optimize the initial parameter vector. 

3.5. Learning continuously parameterized architetures 

Many of the notable successes in deep learning have come 
from novel architectures adapted to particular domains: 
convolutional neural nets, recurrent neural nets and mul¬ 
titask neural nets. We can think of these architectures as 
hard constraints that force particular weights to be zero and 
tie particular pairs of weights together. By softening these 
hard architectural constraints we can form continuous (but 
very high-dimensional) parameterizations of architecture. 
Having access to hypergradients makes learning these soft¬ 
ened architectures feasible. 

We illustrate this “architecture learning” with a multitask 
learning problem, the Omniglot data set (Lake, 2014). This 
data set consists of 28x28 pixel greyscale images of char¬ 
acters from 50 alphabets with up to 55 characters in each 
alphabet but only 15 examples of each character. Rather 
than learning a separate neural net for each alphabet, a mul¬ 
titask approach would be for all the neural nets to share a 
single first layer, pooling statistical strength to learn generic 
Gabor-like filters, while maintaining separate higher layers 
specific to each alphabet. 

We can parameterize any architecture based on weight ty¬ 
ing or weight absence with a pairwise quadratic penalty on 
the weights, w^Aw, where A is a number-of-weights by 
number-of-weights matrix. Learning this enormous matrix 
is clearly infeasible but we can implicitly build such a ma¬ 
trix from lower dimensional structures of manageable size. 

For the Omniglot problem, we learn a penalty for each al¬ 
phabet pair, separately for each neural net layer. Thus, 
for ten three-layer neural networks, the penalty matrix A 
is fully described by three ten-by-ten matrices. An archi¬ 
tecture with fully independent nets for each alphabet cor¬ 
responds to three diagonal matrices while an architecture 
with a mutual lower layer corresponds to two diagonal ma¬ 
trices for the upper layers and a matrix of all ones for the 
lowest layer (Figure 9). 

We use five alphabets from the Omniglot set. To see 
whether our multitask learning system is able to learn high 
level similarities as well as low-level similarities, we repeat 
these five alphabets with the images rotated by 90 degrees 
(Figure 8) to make ten alphabets total. 
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Figure 8. Top: Example characters from 5 alphabets taken from 
the Omniglot dataset. Bottom: Those same alphabets with each 
character rotated by 90°. Distinguishing characters within each of 
these 10 alphabets constitute the 10 tasks in our multi-task learn¬ 
ing experiment. 


Figure 9 shows the learned penalties (normalized by row 
and column to have ones on the diagonal, akin to a correla¬ 
tion matrix). We see that the lowest layer has been partially 
shared, across all alphabets equally, with the upper layers 
much less shared. Interestingly, the top layer penalty learns 
to share weights between the rotated alphabets. 
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Figure 9. Results of the Omniglot multitask experiment. Each 
matrix shows the degree of weight sharing between each pair of 
tasks for that layer. Top: A separate network is trained inde¬ 
pendently for each task. Middle: The lowest-level features were 
forced to be shared. Bottom: The degree of weight sharing be¬ 
tween tasks was optimized by hyperparameter optimization. 


3.6. Implementation Details 

Automatic differentiation (AD) software packages such 
as Theano (Bastien et al., 2012; Bergstra et al., 2010) 
are mainstays of deep learning, significantly speeding up 
development time by providing gradients automatically. 
Since we required access to the internal logic of RMD in 
order to implement Algorithm 2, we implemented our own 
automatic differentiation package for Python, available at 
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github.com/HIPS/autograd. This package differ¬ 
entiates standard Numpy (Oliphant, 2007) code, and can 
differentiate code containing while loops, branches, and 
even gradient evaluations. 

Code for all experiments in this paper is available at 

github.com/HIPS/hypergrad. 


4. Limitations 

Back-propagation for training neural networks has several 
pitfalls that were later addressed by analysis and engineer¬ 
ing. Likewise, the use of hypergradients also has several 
apparent difficulties that need to be addressed before it be¬ 
comes practical. This section explores several issues with 
this technique that became apparent in our experiments. 

When are gradients meaningful? Bengio et al. (1994) 
noted that “learning long-term dependencies with gradient 
descent is difficult.” Our situation is even worse; We are us¬ 
ing gradients to optimize functions which depend on their 
hyperparameters through hundreds of iterations of SGD. 
To make things worse, each elementary iteration’s gradi¬ 
ent itself depends on forward- and then back-propagation 
through a neural network. Thus the same issues that some¬ 
times make elementary learning difficult are compounded. 

For example, Pearlmutter (1996, Chapter 4) showed that 
large learning rates induce chaotic behavior in the learn¬ 
ing dynamics, making the gradient uninformative about the 
medium-term shape of the training objective. This phe¬ 
nomenon is related to the exploding-gradient problem (Pas- 
canu et al., 2012). 

Figure 10 illustrates this phenomenon when training a neu¬ 
ral network having 2 hidden layers for 50 elementary iter¬ 
ations. We partially addressed this problem in our exper¬ 
iments by initializing learning rates to be relatively small, 
and stopping meta-optimization when the magnitude of the 
meta-gradient began to grow. 

Overfitting How many hyperparameters can we fruit¬ 
fully optimize? One limitation is overfitting the validation 
objective, in the same way that optimizing too many pa¬ 
rameters can overfit the training objective. However, the 
same rules of thumb still apply - the size of the validation 
set, assuming examples are i.i.d., gives a rough guide to 
how many hyperparameters can be optimized. 

Discrete parameters Of course, gradients are not neces¬ 
sarily useful for optimizing discrete hyperparameters such 
as the number of layers, or hyperparameters that affect dis¬ 
crete changes such as dropout regularization parameters. 
Some of these difficulties could be addressed by parame¬ 
terizing apparently discrete choices in a continuous man¬ 




Log learning rate 


Figure 10. Top: Loss after training as a function of learning rate. 
Bottom: Gradient of loss with respect to learning rate. When the 
learning rate is high, the gradient becomes uninformative about 
the medium-term behavior of the function. To maintain stability 
during meta-learning, we initialize using a small learning rate so 
as to approach the minimum from the left. 


ner. For instance, the per-hidden-unit regularization of sec¬ 
tion 3.2 is an example of a continuous way to choose the 
number of hidden units. 


5. Related work 

The most closely-related work is Domke (2012), who 
derived algorithms to compute reverse-mode derivatives 
of gradient descent with momentum and L-BFGS, using 
them to update the hyperparameters of CRF image models. 
However, his approach relied on naive caching of all pa¬ 
rameter vectors wi, W 2 ,..., w-p, making it impractical for 
large models with many training iterations. 

Larsen et al. (1998), Eigenmann &. Nossek (1999), Chen 
& Hagan (1999), Bengio (2000), Abdel-Gawad & Ratner 
(2007), and Foo et al. (2008) showed that gradients of reg¬ 
ularization parameters are available in closed form when 
training has converged exactly to a local minimum. In con¬ 
trast, our procedure can compute exact gradients of any 
type of hyperparameter, whether or not learning has con¬ 
verged. 


Support vector machines Chapelle et al. (2002) intro¬ 
duced a differentiable bound on the SVM loss in order to 
be able to compute derivatives with respect to hundreds of 
hyperparameters, including weighting parameters for each 
input dimension in the kernel. However, this bound was 
not tight, since optimizing the SVM objective requires a 
discrete selection of training points. 
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Bayesian methods For Bayesian models with a closed- 
form marginal likelihood, gradients with respect to all 
continuous hyperparameters are usually available. For 
example, this ability has been used to construct com¬ 
plex kernels for Gaussian process models (Rasmussen 
& Williams, 2006, Chapter 5). Variational inference 
also allows gradient-based tuning of hyperparameters in 
Bayesian neural-network models such as deep Gaussian 
processes (Hensman & Lawrence, 2014). Flowever, it does 
not provide gradients with respect to training parameters. 

Gradients with respect to Markov chain parameters 

Salimans et al. (2014) tune the step-size and mass-matrix 
parameters of Hamiltonian Monte Carlo by chaining gradi¬ 
ents from a lower bound on the marginal likelihood through 
several iterations of leapfrog dynamics. Because they used 
only a small number of steps, all intermediate values could 
be stored naively. Our reversible-dynamics memory-tape 
approach could be used to dramatically extend the number 
of HMC iterations used in this approach. 

6. Extensions and future work 

Bayesian optimization with gradients Hypergradients 
could be used with parallel, model-based optimization of 
hyperparameters. For example, Gaussian-process-based 
optimization methods could incorporate gradient informa¬ 
tion (Solak et al., 2003). Such methods could make use of 
parallel evaluations of hypergradients, which might be too 
slow to evaluate in a sequential manner. 

Reversible elementary computation Recurrent neural 
network models can require so much memory to differenti¬ 
ate that checkpointing is required simply to compute their 
elementary gradients (Martens & Sutskever, 2012). Re¬ 
versible computation might offer memory savings for some 
architectures. For example, evaluations of Long Short- 
Term Memory (Hochreiter & Schmidhuber, 1997) or a 
Neural Turing Machines (Graves et al., 2014) rely on long 
chains of mostly-small updates of parameters. Exactly re¬ 
versing these dynamics might allow more memory-efficient 
elementary gradient evaluations of their outputs on very 
long input sequences. 

Exactly reversing other learning methods The memory 
saving trick from Section 2.3 could presumably be applied 
to other momentum-based variants of SGD such as RM- 
Sprop (Tieleman & Hinton, 2012) or Adam (Kingma & Ba, 
2014). 

7. Conclusion 

In this paper, we derived a computationally efficient pro¬ 
cedure for computing gradients through stochastic gradi¬ 


ent descent with momentum. We showed how the approxi¬ 
mate reversibility of learning dynamics can be used to dras¬ 
tically reduce the memory requirement for exactly back- 
propagating gradients through hundreds of training itera¬ 
tions. 

We showed how these gradients allow the optimization of 
validation loss with respect to thousands of hyperparam¬ 
eters, something which was previously infeasible. This 
new ability allows the automatic tuning of most details of 
training neural networks. We demonstrated the tuning of 
detailed training schedules, regularization schedules, and 
neural network architectures. 
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Appendix: Forward vs. reverse-mode 
differentiation 

By the chain rule, the gradient of a set of nested functions 
is given by the product of the individual derivatives of each 
function; 

^/4(/3(/2(/i(a;)))) ^ df4 dfs 8/2 dfi 
dx dfs a /2 dfi dx 

If each function has multivariate inputs and outputs, the 
gradients are Jacobian matrices. 

Forward and reverse mode differentiation differ only by the 
order in which they evaluate this product. Forward-mode 
differentiation works by multiplying gradients in the same 
order as the functions are evaluated: 

dfiihihihjx)))) ^ a /4 / a /3 f a /2 dfi\\ 

dx df 3 \df 2 \dh'dxJJ 

Reverse-mode multiplies the gradients in the opposite or¬ 
der, starting from the final result; 


9/4(/3(/2(/l(a;)))) / 

'(dU 

dh\ 

df 2 \ 

dfi 

dx \ 

Xdh 

df 2 ) 

dfi) 

dx 


In an optimization setting, the final result of the nested 
functions, is a scalar, while the input x and intermedi¬ 
ate values, /i — /a, can be vectors. In this scenario the ad¬ 
vantage of reverse-mode differentiation is very clear. Let’s 
imagine that the dimensionality of all the intermediate vec¬ 
tors is D. In reverse mode, we start from the (scalar) output, 
and multiply by the next D x D Jacobian at each step. The 
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value we accumulate is just a Z?-dimensional vector. In for¬ 
ward mode, however, we must accumulate an entire Dx D 
matrix at each step. But do we have still have to compute 
and instantiate the D x D Jacobian matrices themselves ei¬ 
ther way? In general, yes. But in the (common) case that 
the vector-to-vector functions are either elementwise op¬ 
erations or (reshaped) matrix multiplications, the Jacobian 
matrices can actually be very sparse, and multiplication by 
the Jacobian can be performed efficiently without instanti¬ 
ation (Pearlmutter & Siskind, 2008). 

The main drawback of reverse-mode differentiation is that 
intermediate values must be maintained in memory during 
the forward pass. In sections 2.1 and 2.3, we show how 
to drastically reduce the memory requirements of reverse¬ 
mode differentiation when differentiating through the en¬ 
tire learning procedure. 
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